AITopics | suffix array

Collaborating Authors

suffix array

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora

Yoneda, Masataka, Matsushita, Yusuke, Kamoda, Go, Suenaga, Kohei, Akiba, Takuya, Waga, Masaki, Yokoi, Sho

arXiv.org Machine LearningFeb-12-2026

We present an ultra-fast and flexible search algorithm that enables search over trillion-scale natural language corpora in under 0.3 seconds while handling semantic variations (substitution, insertion, and deletion). Our approach employs string matching based on suffix arrays that scales well with corpus size. To mitigate the combinatorial explosion induced by the semantic relaxation of queries, our method is built on two key algorithmic ideas: fast exact lookup enabled by a disk-aware design, and dynamic corpus-aware pruning. We theoretically show that the proposed method suppresses exponential growth in the search space with respect to query length by leveraging statistical properties of natural language. In experiments on FineWeb-Edu (Lozhkov et al., 2024) (1.4T tokens), we show that our method achieves significantly lower search latency than existing methods: infini-gram (Liu et al., 2024), infini-gram mini (Xu et al., 2025), and SoftMatcha (Deguchi et al., 2025). As a practical application, we demonstrate that our method identifies benchmark contamination in training corpora, unidentified by existing approaches. We also provide an online demo of fast, soft search across corpora in seven languages.

large language model, machine learning, pattern recognition, (25 more...)

arXiv.org Machine Learning

2602.10908

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Europe > Austria > Styria > Graz (0.04)
Europe > Austria > Vienna (0.04)
(17 more...)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Leisure & Entertainment > Sports > Olympic Games (0.95)
Health & Medicine > Therapeutic Area > Immunology (0.92)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(5 more...)

Add feedback

Beat the long tail: Distribution-Aware Speculative Decoding for RL Training

Shao, Zelei, Srivatsa, Vikranth, Srivastava, Sanjana, Wu, Qingyang, Ariyak, Alpay, Wu, Xiaoxia, Patel, Ameen, Wang, Jue, Liang, Percy, Dao, Tri, Zhang, Ce, Zhang, Yiying, Athiwaratkun, Ben, Xu, Chenfeng, Wang, Junxiong

arXiv.org Artificial IntelligenceNov-19-2025

Reinforcement learning(RL) post-training has become essential for aligning large language models (LLMs), yet its efficiency is increasingly constrained by the rollout phase, where long trajectories are generated token by token. We identify a major bottleneck:the long-tail distribution of rollout lengths, where a small fraction of long generations dominates wall clock time and a complementary opportunity; the availability of historical rollouts that reveal stable prompt level patterns across training epochs. Motivated by these observations, we propose DAS, a Distribution Aware Speculative decoding framework that accelerates RL rollouts without altering model outputs. DAS integrates two key ideas: an adaptive, nonparametric drafter built from recent rollouts using an incrementally maintained suffix tree, and a length aware speculation policy that allocates more aggressive draft budgets to long trajectories that dominate makespan. This design exploits rollout history to sustain acceptance while balancing base and token level costs during decoding. Experiments on math and code reasoning tasks show that DAS reduces rollout time up to 50% while preserving identical training curves, demonstrating that distribution-aware speculative decoding can significantly accelerate RL post training without compromising learning quality.

large language model, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2511.13841

Country: North America > United States > California (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)

Add feedback

CREST: Effectively Compacting a Datastore For Retrieval-Based Speculative Decoding

Ho, Sophia, Park, Jinsol, Wang, Patrick

arXiv.org Artificial IntelligenceAug-7-2024

We present CREST (Compact Retrieval-Based Speculative Decoding), a redesign of REST that allows it to be effectively "compacted". REST is a drafting technique for speculative decoding based on retrieving exact n-gram matches of the most recent n tokens generated by the target LLM from a datastore. The key idea of CREST is to only store a subset of the smallest and most common n-grams in the datastore with the hope of achieving comparable performance with less storage space. We found that storing a subset of n-grams both reduces storage space and improves performance. CREST matches REST's accepted token length with 10.6-13.5x Although REST can achieve a high draft token acceptance rate, the static nature of the datastore introduces a new challenge Recently, Speculative Decoding has gained traction for accelerating regarding storage space.

crest, dataset, datastore, (13 more...)

arXiv.org Artificial Intelligence

2408.04678

Country: North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.05)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

Liu, Jiacheng, Min, Sewon, Zettlemoyer, Luke, Choi, Yejin, Hajishirzi, Hannaneh

arXiv.org Artificial IntelligenceJan-30-2024

Are n-gram language models still relevant in this era of neural large language models (LLMs)? Our answer is yes, and we show their values in both text analysis and improving neural LLMs. Yet this necessitates modernizing n-gram models in two aspects. First, we train them at the same data scale as neural LLMs -- 1.4 trillion tokens. This is the largest n-gram model ever built. Second, existing n-gram models use small n which hinders their performance; we instead allow n to be arbitrarily large, by introducing a new $\infty$-gram LM with backoff. Instead of pre-computing n-gram count tables (which would be very expensive), we develop an engine named infini-gram -- powered by suffix arrays -- that can compute $\infty$-gram (as well as n-gram with arbitrary n) probabilities with millisecond-level latency. The $\infty$-gram framework and infini-gram engine enable us to conduct many novel and interesting analyses of human-written and machine-generated text: we find that the $\infty$-gram LM has fairly high accuracy for next-token prediction (47%), and can complement neural LLMs to greatly reduce their language modeling perplexities. When analyzing machine-generated text, we also observe irregularities in the machine--$\infty$-gram agreement level with respect to the suffix length, which indicates deficiencies in neural LLM pretraining and the positional embeddings of Transformers. We open-source our infini-gram engine in the hopes of enabling more study on how to best use verbatim information retrieved from large text corpora.

reference data, suffix array, training data, (16 more...)

arXiv.org Artificial Intelligence

2401.17377

Country:

North America > United States (0.14)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.64)

Industry: Law (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Data Structures and Algorithms Full Course 【𝙁𝙧𝙚𝙚】

#artificialintelligenceMay-27-2022, 16:00:45 GMT

Data Structures and Algorithms - A data structure is a named location that can be used to store and organize data. And, an algorithm is a collection of steps to

data structure, hash table, source code, (11 more...)

#artificialintelligence

Genre: Instructional Material > Course Syllabus & Notes (1.00)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.58)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.41)

Add feedback

A Big Data Approach for Sequences Indexing on the Cloud via Burrows Wheeler Transform

Randazzo, Mario, Rombo, Simona E.

arXiv.org Artificial IntelligenceJul-20-2020

Precision Medicine aims to design individualized strategies for diagnostic or therapeutic decision-making, based on both genotypic and phenotypic information. It allows scientists and clinicians to understand which therapeutic and preventive approaches to a specific illness can work effectively in subgroups of patients based on their genetic makeup, lifestyle, and environmental factors [15]. The diffusion of high-throughput assays, such as next-generation sequencing (NGS) and mass spectrometry (MS), has led to fast accumulation of sequences and other omics data which can be used to enable Precision Medicine in practice. As an example, specific disease biomarkers may be identified by cleaning up raw data generated by NGS or MS, and then experimentally validated in laboratory. An important problem in this context is the indexing of NGS data [8].

artificial intelligence, data mining, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2007.10095

Country:

Africa > Eswatini > Manzini > Manzini (0.05)
North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Santa Clara County > San Jose (0.04)

Genre: Research Report (0.40)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Data Science > Data Mining > Big Data (0.69)
Information Technology > Artificial Intelligence > Machine Learning (0.47)

Add feedback

40 Years of Suffix Trees

Communications of the ACMMar-24-2016, 13:28:47 GMT

When William Legrand finally decrypted the string, it did not seem to make much more sense than it did before. But at least it did sound more like natural language, and eventually guided the main character of Edgar Allan Poe's "The Gold-Bug"36 to discover the treasure he had been after. Legrand solved a substitution cipher using symbol frequencies. He first looked for the most frequent symbol and changed it into the most frequent letter of English, then similarly inferred the most frequent word, then punctuation marks, and so on. Both before and after 1843, the natural impulse when faced with some mysterious message has been to count frequencies of individual tokens or subassemblies in search of a clue. Perhaps one of the most intense and fascinating subjects for this kind of scrutiny have been biosequences. As soon as some such sequences became available, statistical analysts tried to link characters or blocks of characters to relevant biological functions.

artificial intelligence, natural language, text processing, (17 more...)

Communications of the ACM

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Africa > Eswatini > Manzini > Manzini (0.05)
(11 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.69)

Add feedback

Unsupervised Phrasal Near-Synonym Generation from Text Corpora

Gupta, Dishan (Carnegie Mellon University) | Carbonell, Jaime (Carnegie Mellon University) | Gershman, Anatole (Carnegie Mellon University) | Klein, Steve (Meaningful Machines, LLC) | Miller, David (Meaningful Machines, LLC)

AAAI ConferencesMar-6-2015

Unsupervised discovery of synonymous phrases is useful in a variety of tasks ranging from text mining and search engines to semantic analysis and machine translation. This paper presents an unsupervised corpus-based conditional model: Near-Synonym System (NeSS) for finding phrasal synonyms and near synonyms that requires only a large monolingual corpus. The method is based on maximizing information-theoretic combinations of shared contexts and is parallelizable for large-scale processing. An evaluation framework with crowd-sourced judgments is proposed and results are compared with alternate methods, demonstrating considerably superior results to the literature and to thesaurus look up for multi-word phrases. Moreover, the results show that the statistical scoring functions and overall scalability of the system are more important than language specific NLP tools. The method is language-independent and practically useable due to accuracy and real-time performance via parallel decomposition.

artificial intelligence, information retrieval, natural language, (17 more...)

AAAI Conferences

Twenty-Ninth AAAI Conference on Artificial Intelligence

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.15)
North America > United States > New York (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > Czechia > Prague (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.88)

Add feedback

Fast Computation of Subpath Kernel for Trees

Kimura, Daisuke, Kashima, Hisashi

arXiv.org Machine LearningJun-18-2012

The kernel method is a potential approach to analyzing structured data such as sequences, trees, and graphs; however, unordered trees have not been investigated extensively. Kimura et al. (2011) proposed a kernel function for unordered trees on the basis of their subpaths, which are vertical substructures of trees responsible for hierarchical information in them. Their kernel exhibits practically good performance in terms of accuracy and speed; however, linear-time computation is not guaranteed theoretically, unlike the case of the other unordered tree kernel proposed by Vishwanathan and Smola (2003). In this paper, we propose a theoretically guaranteed linear-time kernel computation algorithm that is practically fast, and we present an efficient prediction algorithm whose running time depends only on the size of the input tree. Experimental results show that the proposed algorithms are quite efficient in practice.

algorithm, artificial intelligence, machine learning, (13 more...)

arXiv.org Machine Learning

1206.4642

Country: Asia > Japan > Honshū (0.14)

Genre: Research Report > New Finding (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Kernel Methods (0.56)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.49)

Add feedback